Efficient Text Classification Using Term Projection
نویسندگان
چکیده
In this paper, we propose an efficient text classification method using term projection. Firstly, we use a modified χ statistic to project terms into predefined categories, which is more efficient compared to other clustering methods. Afterwards, we utilize the generated clusters as features to represent the documents. The classification is then performed in a rule-based manner or via SVM. Experiment results show that our modified χ statistic feature selection method outperforms traditional χ statistic especially at lower dimensionalities. And our method is also more efficient than Latent Semantic Analysis (LSA) on homogeneous dataset. Meanwhile, we can reduce the feature dimensionality by three orders of magnitude to save training and testing cost, and maintain comparable accuracy. Moreover, we could use a small training set to gain an approximately 4.3% improvement on heterogeneous dataset as compared to traditional method, which indicates that our method has better generalization capability.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملTopic Modeling and Classification of Cyberspace Papers Using Text Mining
The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کامل